{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5bf1de44-4047-46cf-a04c-dbf910d9e179",
   "metadata": {},
   "source": [
    "# BM25 Retriever\n",
    "In this guide, we define a bm25 retriever that search documents using bm25 method.\n",
    "\n",
    "This notebook is very similar to the RouterQueryEngine notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e73fead-ec2c-4346-bd08-e183c13c7e29",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a2d59778-4cda-47b5-8cd0-b80fee91d1e4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# NOTE: This is ONLY necessary in jupyter notebook.\n",
    "# Details: Jupyter runs an event-loop behind the scenes.\n",
    "#          This results in nested event-loops when we start an event-loop to make async queries.\n",
    "#          This is normally not allowed, we use nest_asyncio to allow it for convenience.\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "04941476",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import openai\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
    "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c628448c-573c-4eeb-a7e1-707fe8cc575c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: NumExpr detected 16 cores but \"NUMEXPR_MAX_THREADS\" not set, so enforcing safe limit of 8.\n",
      "NumExpr defaulting to 8 threads.\n"
     ]
    }
   ],
   "source": [
    "import logging\n",
    "import sys\n",
    "\n",
    "logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
    "logging.getLogger().handlers = []\n",
    "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
    "\n",
    "from llama_index import (\n",
    "    SimpleDirectoryReader,\n",
    "    ServiceContext,\n",
    "    StorageContext,\n",
    "    VectorStoreIndex,\n",
    ")\n",
    "from llama_index.retrievers import BM25Retriever\n",
    "from llama_index.indices.vector_store.retrievers.retriever import VectorIndexRetriever\n",
    "from llama_index.llms import OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "787174ed-10ce-47d7-82fd-9ca7f891eea7",
   "metadata": {},
   "source": [
    "## Load Data\n",
    "\n",
    "We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1fc1b8ac-bf55-4d60-841c-61698663322f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"../data/paul_graham\").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7081194a-ede7-478e-bff2-23e89e23ef16",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# initialize service context (set chunk size)\n",
    "llm = OpenAI(model=\"gpt-4\")\n",
    "service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm)\n",
    "nodes = service_context.node_parser.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8f61bca2-c3b4-4ef0-a8f1-367933aa6d05",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# initialize storage context (by default it's in-memory)\n",
    "storage_context = StorageContext.from_defaults()\n",
    "storage_context.docstore.add_documents(nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0d6162df-9da7-4aad-a2ca-eb318f67daec",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "index = VectorStoreIndex(\n",
    "    nodes=nodes,\n",
    "    storage_context=storage_context,\n",
    "    service_context=service_context,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fa4b18c-4f9b-4311-b1d9-9cbf24687e2c",
   "metadata": {},
   "source": [
    "## BM25 Retriever\n",
    "\n",
    "We will search document with bm25 retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5e94711d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install rank_bm25"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ee3f7c3b-69b4-48d5-bf22-ac51a4e3179f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "retriever = BM25Retriever.from_defaults(index, similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7b8c4c12-1a30-425e-8312-04be050b2101",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 880f345a-439c-49a0-ac49-55f2dbb61ee1<br>**Similarity:** 7.322087057741413<br>**Text:** This name didn't last long before it was replaced by \"software as a service,\" but it was current ...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** dfb3f58d-da85-4c90-b070-f817db51023c<br>**Similarity:** 7.121406119821665<br>**Text:** There, right on the wall, was something you could make that would last.Paintings didn't become ob...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.response.notebook_utils import display_source_node\n",
    "\n",
    "# will retrieve all context from the author's life\n",
    "nodes = retriever.retrieve(\n",
    "    \"Can you give me all the context regarding the author's life?\"\n",
    ")\n",
    "for node in nodes:\n",
    "    display_source_node(node)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "2749c34e-97c0-4bd5-8358-377a94b8b2d8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 450904cc-d483-4b66-90a2-c36608de5c4b<br>**Similarity:** 6.84581665569314<br>**Text:** That seemed unnatural to me, and on this point the rest of the world is coming around to my way o...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 56ef1b5a-5519-4e31-bedb-9d81ce80f0d8<br>**Similarity:** 6.259657209129812<br>**Text:** So I decided to take a shot at it.It took 4 years, from March 26, 2015 to October 12, 2019.It was...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "nodes = retriever.retrieve(\"What did Paul Graham do after RISD?\")\n",
    "for node in nodes:\n",
    "    display_source_node(node)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73460e63-3c7a-49f4-80d2-ea9f4033848e",
   "metadata": {},
   "source": [
    "## Router Retriever with bm25 method\n",
    "\n",
    "Now we will combine bm25 retriever with vector index retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5ceedab8-c9e8-4ec9-be91-cba27482a7e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.tools import RetrieverTool\n",
    "\n",
    "vector_retriever = VectorIndexRetriever(index)\n",
    "bm25_retriever = BM25Retriever.from_defaults(index, similarity_top_k=2)\n",
    "\n",
    "retriever_tools = [\n",
    "    RetrieverTool.from_defaults(\n",
    "        retriever=vector_retriever,\n",
    "        description=\"Useful in most cases\",\n",
    "    ),\n",
    "    RetrieverTool.from_defaults(\n",
    "        retriever=bm25_retriever,\n",
    "        description=\"Useful if searching about specific information\",\n",
    "    ),\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "dcb850f9-c8b4-4843-b66c-8cb2a977c0e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.retrievers import RouterRetriever\n",
    "\n",
    "retriever = RouterRetriever.from_defaults(\n",
    "    retriever_tools=retriever_tools,\n",
    "    service_context=service_context,\n",
    "    select_multi=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "f3e29b3e-38e3-4f0b-a263-d267ed003cc6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selecting retriever 0: The author's life context is a broad topic and would be useful in most cases..\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** dfb3f58d-da85-4c90-b070-f817db51023c<br>**Similarity:** 0.7839511925738453<br>**Text:** There, right on the wall, was something you could make that would last.Paintings didn't become ob...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 47b8287e-23e4-4c1a-b7f9-6db6d4c5786a<br>**Similarity:** 0.7816309032859696<br>**Text:** The students and faculty in the painting department at the Accademia were the nicest people you c...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# will retrieve all context from the author's life\n",
    "nodes = retriever.retrieve(\n",
    "    \"Can you give me all the context regarding the author's life?\"\n",
    ")\n",
    "for node in nodes:\n",
    "    display_source_node(node)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1b4b717",
   "metadata": {},
   "source": [
    "## Advanced - Hybrid Retriever + Re-Ranking\n",
    "\n",
    "Here we extend the base retriever class and create a custom retriever that always uses the vector retriever and BM25 retreiver.\n",
    "\n",
    "Then, nodes can be re-ranked and filtered. This lets us keep intermediate top-k values large and letting the re-ranking filter out un-needed nodes.\n",
    "\n",
    "To best demonstrate this, we will use a larger set of source documents -- Chapter 3 from the 2022 IPCC Climate Report."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4be4ec7",
   "metadata": {},
   "source": [
    "### Setup data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5ebd3ddf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 20.7M  100 20.7M    0     0  15.4M      0  0:00:01  0:00:01 --:--:-- 15.5M\n"
     ]
    }
   ],
   "source": [
    "!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f7b5a97",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install pypdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "e46f93b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import (\n",
    "    VectorStoreIndex,\n",
    "    ServiceContext,\n",
    "    StorageContext,\n",
    "    SimpleDirectoryReader,\n",
    ")\n",
    "from llama_index.llms import OpenAI\n",
    "\n",
    "# load documents\n",
    "documents = SimpleDirectoryReader(\n",
    "    input_files=[\"IPCC_AR6_WGII_Chapter03.pdf\"]\n",
    ").load_data()\n",
    "\n",
    "# initialize service context (set chunk size)\n",
    "# -- here, we set a smaller chunk size, to allow for more effective re-ranking\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
    "service_context = ServiceContext.from_defaults(chunk_size=256, llm=llm)\n",
    "nodes = service_context.node_parser.get_nodes_from_documents(documents)\n",
    "\n",
    "# initialize storage context (by default it's in-memory)\n",
    "storage_context = StorageContext.from_defaults()\n",
    "storage_context.docstore.add_documents(nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ccfc9879",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorStoreIndex(\n",
    "    nodes, storage_context=storage_context, service_context=service_context\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "0474be06",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.retrievers import BM25Retriever\n",
    "\n",
    "# retireve the top 10 most similar nodes using embeddings\n",
    "vector_retriever = index.as_retriever(similarity_top_k=10)\n",
    "\n",
    "# retireve the top 10 most similar nodes using bm25\n",
    "bm25_retriever = BM25Retriever.from_defaults(index, similarity_top_k=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99e603a1",
   "metadata": {},
   "source": [
    "### Custom Retriever Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "917a3d85",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.retrievers import BaseRetriever\n",
    "\n",
    "\n",
    "class HybridRetriever(BaseRetriever):\n",
    "    def __init__(self, vector_retriever, bm25_retriever):\n",
    "        self.vector_retriever = vector_retriever\n",
    "        self.bm25_retriever = bm25_retriever\n",
    "\n",
    "    def _retrieve(self, query, **kwargs):\n",
    "        bm25_nodes = self.bm25_retriever.retrieve(query, **kwargs)\n",
    "        vector_nodes = self.vector_retriever.retrieve(query, **kwargs)\n",
    "\n",
    "        # combine the two lists of nodes\n",
    "        all_nodes = []\n",
    "        node_ids = set()\n",
    "        for n in bm25_nodes + vector_nodes:\n",
    "            if n.node.node_id not in node_ids:\n",
    "                all_nodes.append(n)\n",
    "                node_ids.add(n.node.node_id)\n",
    "        return all_nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "9495e47c-2008-41ac-87e8-582d6d798190",
   "metadata": {},
   "outputs": [],
   "source": [
    "index.as_retriever(similarity_top_k=5)\n",
    "\n",
    "hybrid_retriever = HybridRetriever(vector_retriever, bm25_retriever)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfd3b287",
   "metadata": {},
   "source": [
    "### Re-Ranker Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "638e7786",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Use pytorch device: cpu\n"
     ]
    }
   ],
   "source": [
    "from llama_index.indices.postprocessor import SentenceTransformerRerank\n",
    "\n",
    "reranker = SentenceTransformerRerank(\n",
    "    top_n=4, model=\"cross-encoder/ms-marco-MiniLM-L-6-v2\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a78eec",
   "metadata": {},
   "source": [
    "### Retrieve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "9235f79d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████| 1/1 [00:00<00:00,  1.16it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Initial retrieval:  19  nodes\n",
      "Re-ranked retrieval:  4  nodes\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from llama_index import QueryBundle\n",
    "\n",
    "nodes = hybrid_retriever.retrieve(\"What is the impact of climate change on the ocean?\")\n",
    "reranked_nodes = reranker.postprocess_nodes(\n",
    "    nodes,\n",
    "    query_bundle=QueryBundle(\"What is the impact of climate change on the ocean?\"),\n",
    ")\n",
    "\n",
    "print(\"Initial retrieval: \", len(nodes), \" nodes\")\n",
    "print(\"Re-ranked retrieval: \", len(reranked_nodes), \" nodes\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "5674f1d6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** f297871f-603e-4192-b289-f724e96ccff8<br>**Similarity:** 6.131191253662109<br>**Text:** Observations: vulnerabilities and impacts\n",
       "Anthropogenic climate change has exposed ocean and coas...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 61eb5022-6988-4e2a-98ab-3dc58bddde6a<br>**Similarity:** 6.01539945602417<br>**Text:** 3\n",
       "469Oceans and Coastal Ecosystems and Their Services  Chapter 3\n",
       "Frequently Asked Questions\n",
       "FAQ 3...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 7ddaf395-f5f2-4593-b4fa-d8e82b57696c<br>**Similarity:** 4.70263671875<br>**Text:** {Box 3.2, 3.2.2.1, 3.4.2.5, 3.4.2.10, 3.4.3.3, Cross-Chapter \n",
       "Box PALEO in Chapter 1}\n",
       "Climate imp...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 574fd2e2-dd52-45a7-bb40-ca25c9b2cc00<br>**Similarity:** 3.768509864807129<br>**Text:** In both polar regions, climate-induced changes in ocean and sea ice conditions have \n",
       "expanded the...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.response.notebook_utils import display_source_node\n",
    "\n",
    "for node in reranked_nodes:\n",
    "    display_source_node(node)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24caa0d",
   "metadata": {},
   "source": [
    "### Full Query Engine "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "549b9a96",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████| 1/1 [00:00<00:00,  1.25it/s]\n"
     ]
    }
   ],
   "source": [
    "from llama_index.query_engine import RetrieverQueryEngine\n",
    "\n",
    "query_engine = RetrieverQueryEngine.from_args(\n",
    "    retriever=hybrid_retriever,\n",
    "    node_postprocessors=[reranker],\n",
    "    service_context=service_context,\n",
    ")\n",
    "\n",
    "response = query_engine.query(\"What is the impact of climate change on the ocean?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "171668b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**`Final Response:`** Climate change has greatly impacted life in the ocean and along its coasts. It has exposed ocean and coastal ecosystems to conditions that are unprecedented over millennia. Fundamental changes in the physical and chemical characteristics of the ocean are changing the timing, distribution, and abundance of oceanic and coastal organisms. These changes have been observed through multi-decadal observations, laboratory studies, and meta-analyses of published data. Additionally, climate change is degrading ocean health, altering stocks of marine resources, and threatening the sustenance provided to Indigenous Peoples, livelihoods of artisanal fisheries, and marine-based industries such as tourism, shipping, and transportation. The impacts of climate change on the ocean can influence human activities and employment by altering resource availability, spreading pathogens, flooding shorelines, and degrading ocean ecosystems."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.response.notebook_utils import display_response\n",
    "\n",
    "display_response(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39ac4767",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index",
   "language": "python",
   "name": "llama-index"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
